A Bayesian criterion for cluster stability
نویسندگان
چکیده
We present a technique for evaluating and comparing how clusterings reveal structure inherent in the data set. Our technique is based on a criterion evaluating how much point-to-cluster distances may be perturbed without affecting the membership of the points. Although similar to some existing perturbation methods, our approach distinguishes itself in five ways. First, the strength of the perturbations is indexed by a prior distribution controlling how close to boundary regions a point may be before it is considered unstable. Second, our approach is exact in that we integrate over all the perturbations; in practice, this can be done efficiently for well-chosen prior distributions. Third, we provide a rigorous theoretical treatment of the approach, showing that it is consistent for estimating the correct number of clusters. Fourth, it yields a detailed picture of the behavior and structure of the clustering. Finally, it is computationally tractable and easy to use, requiring only a point-to-cluster distance matrix as input. In a simulation study, we show that it outperforms several existing methods in terms of recovering the correct number of clusters. We also illustrate the technique in three real data sets. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013
منابع مشابه
Bayesian Optimum Design Criterion for Multi Models Discrimination
The problem of obtaining the optimum design, which is able to discriminate between several rival models has been considered in this paper. We give an optimality-criterion, using a Bayesian approach. This is an extension of the Bayesian KL-optimality to more than two models. A modification is made to deal with nested models. The proposed Bayesian optimality criterion is a weighted average, where...
متن کاملAn Efficient Bayesian Optimal Design for Logistic Model
Consider a Bayesian optimal design with many support points which poses the problem of collecting data with a few number of observations at each design point. Under such a scenario the asymptotic property of using Fisher information matrix for approximating the covariance matrix of posterior ML estimators might be doubtful. We suggest to use Bhattcharyya matrix in deriving the information matri...
متن کاملBayesian inference for multiband image segmentation via model-based cluster trees
We consider the problem of multiband image clustering and segmentation. We propose a new methodology for doing this, called modelbased cluster trees. This is grounded in model-based clustering, which bases inference on finite mixture models estimated by maximum likelihood using the EM algorithm, and automatically chooses the number of clusters by Bayesian model selection, approximated using BIC...
متن کاملInvestigation on Several Model Selection Criteria for Determining the Number of Cluster
Abstract Determining the number of clusters is a crucial problem in clustering. Conventionally, selection of the number of clusters was effected via cost function based criteria such as Akaike’s information criterion (AIC), the consistent Akaike’s information criterion (CAIC), the minimum description length (MDL) criterion which formally coincides with the Bayesian inference criterion (BIC). In...
متن کاملStability evaluation of Neural and statistical Classifiers based on Modified Semi - bounded Plug - in Algorithm
This paper illustrates a new criterion for evaluating neural networks stability compared to the Bayesian classifier. The stability comparison is performed by the error rate probability densities estimation using the modified semi-bounded Plug-in algorithm. We attempt, in this work, to demonstrate that the Bayesian approach for neural networks improves the performance and stability degree of the...
متن کاملSpeaker Clustering Based on Bayesian Information Criterion
This paper presents an effective method for clustering unknown speech utterances based on their associated speakers. The proposed method jointly optimizes the generated clusters and the number of clusters according to a Bayesian information criterion (BIC). The criterion assesses a partitioning of utterances based on how high the level of withincluster homogeneity can be achieved at the expense...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Statistical Analysis and Data Mining
دوره 6 شماره
صفحات -
تاریخ انتشار 2013